Skip to content

Wave 2 polish: fix shader build, remove hardcoded paths, add CI#1

Merged
Peterc3-dev merged 1 commit into
masterfrom
wave2-polish
May 30, 2026
Merged

Wave 2 polish: fix shader build, remove hardcoded paths, add CI#1
Peterc3-dev merged 1 commit into
masterfrom
wave2-polish

Conversation

@Peterc3-dev
Copy link
Copy Markdown
Owner

Summary

Polish pass on the Vulkan-compute PyTorch backend. Focus: a real reproducible-build bug in the shader toolchain, removing hardcoded/leaked developer paths, and adding CI for what can be checked without a GPU.

Changes

Shader build (real bug, fixed + verified)

  • csrc/shaders/compile.sh used glslangValidator -V, which emits SPIR-V 1.0. The 12 subgroup-using quantized kernels (matmul_q4k/q5k/q6k, their _batch variants, and matmul_gpuq4/5/6, plus argmax) use GL_KHR_shader_subgroup reductions that require SPIR-V >= 1.3, so the script could not rebuild them. Switched to --target-env vulkan1.1 (SPIR-V 1.3). Added set -euo pipefail and a non-zero exit if any shader fails.
  • Verified locally with the system glslangValidator: all 38 shaders compile (12 of which failed before).

Hardcoded / leaked paths removed (portability)

  • csrc/vulkan_engine.cpp hardcoded /home/raz/projects/torch-vulkan/csrc/shaders/. Removed it; the engine now falls back to the TORCH_VULKAN_SHADER_DIR env var.
  • torch_vulkan/__init__.py now exports the resolved bundled-shader directory into TORCH_VULKAN_SHADER_DIR so the lazily-constructed VulkanEngine finds the same shaders the Kompute context uses.
  • tests/test_algo_cache.py and tests/bench_layer.py hardcoded /home/raz/builds/pytorch-gfx1150 on sys.path. Replaced with an optional TORCH_VULKAN_PYTORCH_PATH env var.

Tests

  • tests/test_mm.py: the old test_cpu_fallback claimed "relu isn't implemented" — but relu is wired to the Vulkan backend. Split into test_relu (real correctness check vs CPU) and test_unimplemented_op_falls_back_to_cpu (uses sign, which has no Vulkan impl, to actually exercise the boxed CPU fallback).

Lint / cleanup

  • Removed dead imports (numpy, time in persistent_pipeline.py; sys in setup.py) and unused benchmark locals.
  • Aligned the __init__.py module docstring with the README: .to("vulkan") is the supported tensor-creation path; torch.randn(..., device="vulkan") and .vulkan() are only partially wired.
  • README quickstart now compiles shaders and uses cmake -S . -B build (the old cd build referenced a gitignored, non-existent dir).

CI (new)

  • .github/workflows/ci.yml: compiles every shader with --target-env vulkan1.1 (guards the regression above) and runs ruff + py_compile.
  • pyproject.toml: ruff config (E,F,I; E402 ignored because the package intentionally imports torch, then loads _C, then aliases into torch.vulkan).

Verified

  • Shader compilation: all 38 .comp compile with the new flag (ran locally).
  • Ruff: clean (ruff 0.15.15, ran locally on the full tree).
  • py_compile: passes for all Python files.
  • Committed .spv binaries were intentionally left untouched — recompiling them locally produced different bytes (different glslang build) that I can't validate against the target GPU, and the committed ones are the verified artifacts per the README.

UNVERIFIED (no toolchain on this machine)

  • C++ extension build NOT run. cmake is not installed, the Kompute dependency is absent, and there is no Vulkan dev headers / GPU here. The C++ edits are minimal and mechanical (one #include <cstdlib>, a std::getenv fallback block, a comment-only change in torch_vulkan.cpp) but were not compiled.
  • Runtime test suite (pytest tests/) NOT run — requires a Vulkan GPU + Kompute + the custom PyTorch build. No correctness/perf numbers were re-measured; none are claimed here.

TODOs left (not addressed — intent unclear / out of scope)

  • The many "written but not wired" shaders (Q5_K/Q6_K dispatch, rope, rmsnorm, silu_gate, kv-cache attention, etc.) remain roadmap items with no host dispatch.
  • mm_raw (raw VulkanEngine path) is implemented but still not registered to any aten op.

🤖 Generated with Claude Code

- compile.sh: target Vulkan 1.1 (SPIR-V 1.3) so the 12 subgroup-using
  quantized kernels (matmul_q*k*, matmul_gpuq*) actually compile; the default
  `glslangValidator -V` emits SPIR-V 1.0 and failed on them. Added set -euo
  pipefail and a non-zero exit on any failure.
- vulkan_engine.cpp: drop the hardcoded /home/raz/... shader path (leaked a
  developer path and broke on every other machine). Fall back to the
  TORCH_VULKAN_SHADER_DIR env var instead.
- __init__.py: export the resolved bundled-shader dir into
  TORCH_VULKAN_SHADER_DIR so the lazily-constructed VulkanEngine resolves the
  same shaders. Aligned the module docstring with the README (.to("vulkan")
  is the supported path; torch.randn(device=) / .vulkan() are partial).
- tests: removed hardcoded /home/raz/builds/pytorch-gfx1150 sys.path inserts;
  honour TORCH_VULKAN_PYTORCH_PATH instead. Fixed the misleading "relu isn't
  implemented" fallback test (relu IS wired) and split it into a real relu
  correctness test plus a genuine CPU-fallback test using an unimplemented op.
- Removed dead imports (numpy, time, sys) and unused locals flagged by ruff.
- Added pyproject.toml (ruff config) and .github/workflows/ci.yml that
  compiles all shaders and runs ruff + py_compile. The GPU build/test suite is
  not run in CI (no GPU on hosted runners).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Peterc3-dev
Copy link
Copy Markdown
Owner Author

Independent verification — verdict: solid (mergeable)

Re-checked every concrete claim on a separate machine (glslangValidator + ruff present; cmake/Vulkan GPU absent, so C++ build/pytest remain unverified — matching the PR's own UNVERIFIED section).

Shader fix — confirmed real and correct

  • Old glslangValidator -V (SPIR-V 1.0) fails on exactly the 12 quantized matmul kernels with 'subgroup op' : requires SPIR-V 1.3. argmax.comp uses subgroup ops too but compiles under the old flag — the PR correctly claimed only 12 failed.
  • New --target-env vulkan1.1 compiles all 38 .comp cleanly.
  • set -euo pipefail + the if ! glslang... guard: verified the script now exits 1 on a deliberately-broken shader (glslangValidator's own non-zero exit propagates). The old script silently swallowed failures.
  • Committed .spv left untouched — re-ran compile, then git checkout restored a clean tree as described.

Paths — confirmed removed

  • No /home/raz strings remain anywhere in shipped code (csrc/, torch_vulkan/, setup.py, tests/). Env-var fallbacks are correct.
  • VulkanEngine shaderDir_ is now empty if TORCH_VULKAN_SHADER_DIR is unset, but that engine is only reachable via mm_raw, which is not registered to any aten op, and __init__.py always sets the env var via os.environ.setdefault at import. No runtime regression for the live op path (which uses VulkanContext, untouched).

Tests — correct

  • relu IS registered (m.impl("relu", &relu) + relu.spv), so the old test_cpu_fallback was mislabeled. sign is registered nowhere, and the boxed CPU fallback IS registered (m.fallback(...)), so test_unimplemented_op_falls_back_to_cpu is a valid fallback exercise.

Lint/CI — confirmed

  • ruff 0.15.15 clean; py_compile passes. Removed imports (time, numpy, sys) were genuine F401s on master. CI workflow guards the shader regression and scopes out the GPU build honestly.

Overclaims: none. The body is unusually disciplined about the build/pytest gap — independently confirmed cmake is unavailable here, so those remain legitimately unverified rather than overclaimed.

Recommend merge.

@Peterc3-dev Peterc3-dev merged commit a344d77 into master May 30, 2026
2 checks passed
@Peterc3-dev Peterc3-dev deleted the wave2-polish branch May 30, 2026 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant